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Abstract 

Learning an input-output mapping from a set of examples, of the type that many 
neural networks have been constructed to perform, can be regarded as synthesizing an 
approximation of a multi-dimensional function. From this point of view, this form of 
learning is closely related to regularization theory. The theory developed in Poggio 
and Girosi (1989) shows the equivalence between regularization and a class of three- 
layer networks that we call regularization networks or Hyper Basis Functions. These 
networks are not only equivalent to generalized splines, but are also closely related to 
the classical Radial Basis Functions used for interpolation tasks and to several pat- 
tern recognition and neural network algorithms. In this note, we extend the theory 
by introducing ways of dealing with two aspects of learning: learning in the presence 
of unreliable examples and learning from positive and negative examples. These two 
extensions are interesting also from the point of view of the approximation of multi- 
variate functions. The first extension corresponds to dealing with outliers among the 
sparse data. The second one corresponds to exploiting information about points or 
regions in the range of the function that are forbidden. 
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1 Introduction 

In previous papers (Poggio and Girosi, 1989, 1990) we have shown the equiv- 
alence between regularization and a class of three-layer networks that we 
called regularization networks and that are related to the classical interpola- 
tion technique of Radial Basis Functions. 

Let g = {(xi,j/i) <E R n x R}? =1 be a set of data that we want to approx- 
imate by means of a function /. The regularization approach (Tikhonov, 
1963; Tikhonov and Arsenin, 1977; Morozov, 1984; Bertero, 1986) selects 
the function / that solves the variational problem of minimizing the func- 
tional 
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H[f] = 5> - /W) 2 + A ll p /H 2 (1) 



where P is a constraint operator (usually a differential operator), || • || is 
a norm on the function space to which Pf belongs (usually the L 2 norm) 
and A is a positive real number, the so called regularization parameter. The 
structure of the operator P, that is called "stabilizer", embodies the a priori 
knowledge about the solution, and therefore depends on the nature of the 
particular problem that has to be solved. We have shown (Poggio and Girosi, 
1989) that the solution of the variational problem (1) has the following simple 
form: 
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/(x) = X> t G(x;x t )+p(x) 



where G(x) is the Green's function (Stakgold, 1979) of the self-adjoint dif- 
ferential operator PP, P being the adjoint operator of P, p(x) is a linear 
combination of functions that span the null space of P, and the coefficients 
Ci satisfy a linear system of equations that depend on the N "examples", i.e. 
the data to be approximated. The form of the term p(x) depends on the 
stabilizer that has been chosen and on the boundary conditions, and there- 
fore on the particular problem that has to be solved (for instance, it is not 
needed in the case of P corresponding to a Gaussian or bell-shaped Green's 
function). For this reason, and since its inclusion does not modify the main 
conclusions, we will disregard it in the following. In the special case in which 



P is an operator with radial symmetry, the Green's function G is radial and 
therefore the approximating function becomes: 

/(x) = 5>G(||x-*|| 2 ), (2) 

»=i 

which is a sum of radial functions, each with its center X; on a distinct data 
point. Thus the number of radial functions, and corresponding centers, is 
the same as the number of examples. 

In this note we indicate how to extend our theory of learning from exam- 
ples in order to deal with 1) occurence of unreliable examples, 2) negative 
examples. Both problems are also interesting from the point of view of clas- 
sical approximation theory: 

1. discounting "bad" examples corresponds to discarding, in the approxi- 
mation of a function, data points that are outliers. 

2. learning by using negative examples - in addition to positive ones - 
corresponds to approximating a function based not only on points to 
which the function must be close but also on points - or regions - that 
the curve associated with the function must avoid. 



2 Unreliable data 

Suppose that the set g = {(x,-,^) € R n x R}? =1 of data has been obtained 
by random sampling a function /, defined on R n , in presence of noise. We 
are interested in recovering the function /, or an estimate of it, from the 
set of data g. We take a probabilistic approach, and regard the function 
/ and the data g as random, dependent, variables. Using Bayes theorem, 
it is possible to express the conditional probability V[f\g] of the function / 
given the examples g in terms of the a priori probability of /, V[f], and the 
conditional probability of g given /, V[g\f], that is equivalent to a model of 
the noise: 

V[f\g] oc V[g\f] V[f]. (3) 

If the noise is Gaussian the probability V[g\f] can be written as: 



nrt/]« c -££i* (w - /( * ))a ( 4 ) 

where # = ^ and cr; is the variance of the noise related to the i-th data 
point. Under some assumption on the stochastic process / (Marroquin et al., 
1987; Geman and Geman, 1984) it is possible to write the a priori probability 
V[f] in the following way: 

V[f] oc e- A H p /» 2 

where P is a constraint operator (usually a differential operator), || • || is a 
norm on the function space to which Pf belongs (usually the L 2 norm) and 
A a positive real number. This form of probability distribution gives high 
probability only to those functions for which the term ||-P/|| 2 is small, and 
embodies the a priori knowledge that one has about the system. For example 
if one knows that the function / that has been sampled is very smooth, in 
the sense that it does not vary too "quickly" in its domain, the operator P 
will be a differential operator of high degree. 

Using Bayes theorem (3) the a posteriori probability of / can be written 



as 



A simple way to obtain an estimate of the function / from the probability 
distribution (5) consists in taking the so called MAP (Maximum A Posteriori) 
estimate, that is the function that maximizes the a posteriori probability 
V[f\g], or minimizes the exponent in equation (5). Setting for simplicity all 
the variances o~i equal to one fixed variance <r, and defining from here on 

A,- = y { - /fa) , 
the MAP estimate of / is then the minimum of the following functional: 

#o[/] = 2^I>(A0 + A||P/|| 2 ( 6 ) 

where we have defined the quadratic function 



V(x) = x 
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This is equivalent to the so called "regularization technique" (Tikhonov, 
1963; Tikhonov and Arsenin, 1977; Morozov, 1984; Bertero, 1986) that has 
been extensively used in order to solve ill-posed problems, of which this is a 
particular example. The parameter A, that is usually called "regularization 
parameter", determines the trade-off between the level of the noise and the 
strength of the assumptions about the solution, therefore controlling the com- 
promise between the degree of smoothness of the solution and its closeness 
to the data. 

In the approach outlined here we have assumed to know the variance of 
the noise associated with each data point, but this assumption is not always 
realistic. Sometimes we know that some of the data can be affected by a 
high amount of noise, or can be completely wrong. In order to deal with this 
situation, we regard the variances of the noise, as well as the unknown func- 
tion, as random variables. Of course, some a priori knowledge about these 
variables, represented by an appropriate a priori probability distribution, is 
needed. Let us denote by fi the set of random variables {Alili- % means 
of Bayes theorem we can compute the joint probability of the function / and 
of the set /3: 

V[f,p\g}*V[g\f,P]V[f}V[l3} (7) 

where V\g\f,0\ is the same as in equation (4) and V[/3] is the a priori prob- 
ability of the set of variances /3. The model above, that leads to standard 
regularization, is recovered by setting 

ns] = n*(A-#) 

where /?* are some fixed values. Depending on the a priori knowledge on /? 
different models may arise, corresponding to different choices of V[/3]. Here 
we consider the following situation: we have knowledge that a certain per- 
centage, e, of data is spurious (we will call them "outliers") whereas a per- 
centage (1 - e) is characterized by a Gaussian noise distribution of variance 
/?*. Therefore there are only two possibilities: # = /?*, for the "true" data 
points, and /?, = 0, for the outliers. This situation leads to choosing the 
following probability distribution: 
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vm = iik 1 - «w - f) + e w)i ■ < 8) 
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Given the a posteriori probability (7) we are mainly interested in comput- 
ing an estimate of /. Thus what we really need to compute is the marginal 
posterior probability of /, P TO [/], that is obtained integrating equation (7) 
over the variables /?,•: 

Pm[f)= rn<^Hf,/%] 
j ° »=i 

Using the model for V[fi] described by equation (8) we obtain: 

PM oc e - A " p 'n 2 n r dxe ~ xA - K 1 - e ^ x -^ +e *(*)!■ 

The integral yields 

P m [f] oc e-^W ft [—e-r* + 1] 






In order to make clear the meaning of such a marginal probability distri- 
bution we rewrite as: 

P m [/]oce-^Ef =1 ^/(A.) + A||P/lP) 

where we have defined the effective potential 

V eff ( X ) = X i-±Ml + eV* 2 -"*) 

and we have set 7 = In ^. The MAP estimate for / given by this probability 
distribution is obtained by minimizing the functional 

^[/] = ^f;v; // (A i ) + A||P/|| 2 ). (9) 

«=i 

The introduction the random variables /?, leads, therefore, to a new mini- 
mization problem. Let us compare the functional (9) and (6). The functional 



(9) is similar to the standard regularization functional (6), the only difference 
being in the data term. In the standard regularization functional the data 
term consists of the sum over all the data of a quadratic function V of the 
interpolation error A;, and its role is to enforce closeness of the solution to 
the data. In the last case the quadratic function V has been substituted by 
the function V e //, depicted in figures (1) and (2), whose shape depends on 
the parameters 0* and e. 

Figure (1) shows the effective potential for different values of e, and for 
ft* = 1.0. In the case of c = we obviosly recover the regularization model, 



since 



limV ef f{x) = V{x) = x 2 . 

When e is different from zero V ef f(x) has two different behaviours: quadratic 
in a neighborhood of the origin, and constant far away from it. The effect 
of this behavior is clear: closeness to the data is enforced only when the 
interpolation error is small. In particular we notice that: 



]xmV eff (x) = 2(l-e)x 2 . 

When e increases and approaches 1 the effective potential becomes flatter 
and flatter, which is equivalent to the effective variance of the noise becoming 
larger and larger. 

Let us consider the case of positive values of 7, that corresponds to values 
of e smaller than 0.5. This is the usual case, since e represents the percentage 
of "true" data points. (2) In the limit of /?* -> 00 the effective potential V eff 
is quadratic if the absolute value of its argument is smaller than ^7 and 
constant otherwise (fig. 2). This corresponds to the situation in which we 
have "true" data points without noise: therefore data points are considered 
reliable if the interpolation error is smaller than a threshold (yfi) and their 
contribution neglected otherwise. In the case of negative values of 7, which is 
the case of a percentage of outliers greater than 50%, the effective potential, 
that is already flat, becomes even natter when /?* increases. This case is not 
very interesting and in the following we will always make the assumption 
that 7 > 0, that is e < 0.5. 

The standard regularization functional and the functional (9) admit a 
simple physical interpretation. Let us consider for simplicity a function de- 
fined on a one-dimensional lattice. The value of the function /(xj) at site i 
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Figure 1: The effective potential for If = 1 and different values of e. 
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Figure 2: The effective potential for e = 0.1 and different values of ji. 



is regarded as the position of a particle that can move only in the vertical 
direction. The particle is connected by a spring to a point that corresponds 
to the data value t/;, and is also connected by springs to some neighboring 
particles. The size of the neighborhood can vary, but the overall effect is such 
that the values of the function at neighboring sites tend to be the same. The 
particle is attracted, with a quadratic potential, by the data point, but it is 
also attracted by the neighboring particles: the configuration of the system 
will be the one that minimizes the total energy, depending on the trade off 
between these two different effects. The energy of the system corresponds 
in this scheme to the standard regularization functional: the first term is 
associated to the springs connecting the particle to the data point, and the 
second term is associated to the the springs connecting neighboring particles, 
whose role is to enforce smoothness of the final configuration. The stabilizer 
is represented by the relative strength and the extension of the connections 
of the particles at neighboring sites: a stabilizer of high degree corresponds 
to a system in which a particle at a site is connected to particles at sites very 
far away. 

The functional (9) admits a similar interpretation, the only difference 
being the kind of springs that connect the function value to the data point: 
in this case the potential energy of these springs is not quadratic anymore, 
that is the force associated to each spring does not grow linearly with its 
elongation. The potential energy becomes constant when the elongation is 
larger than the threshold e, and the force (that is proportional to the first 
derivative of the potential energy) goes to zero. In a sense these springs break 
if we try to stretch them too much. 

3 Negative examples 

As we have seen in the previous section, standard regularization admits an 
interpretation in term of linear springs, whereas regularization in presence 
of unreliable data needs an interpretation in term of nonlinear springs, that 
break when the elongation is too large. Nonlinear springs have also been 
used to deal with discontinuities (Geiger and Girosi, 1989, 1990; Blake and 
Zisserman, 1987), and we show now another case in which they are very 
useful. 

In many situations, further source of information about the function may 
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he |)i»li'!il ial associated tu a repulsive spring lor < 



aim 



consist of knowing llial its value al sonic given point lias to l>e lar lioin a 
.u, i \ ' t m i value (which, in this context, can he seen as a "negat ivc example I. 
We shall account lor the presence of negative examples l>v hit rudiicinn a 
<|iia<lratic repulsive ti>rm ( a sort of "repulsive" sprinu) in the rcgulari/at ion 
runctional. one tor each negative exam[)le (lor a related trick, see kass et ah. 
li)S7). However the introduction of such a term might make the regularixa- 
tion fimct ional unl>oun<le<l from below, because the repulsive spring will tend 
to push t he value of t he fund ion up to inliuit v. The simplest way to prevent 
this occurencv is to allow the spring constant to decrease with the increasing 
elongation, or in the extreme case, to break at some point. We can use the 
same model of nonlinear spring of the previous section, and just reverse (he 
sign of t he associated potential (see figure (3)). 

If {(t.,.//.,) *E //" /i'J'Li is t lit- set of negative examples, and if we deline 
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the regularizing functional can be written as 

H[f] = J2V(A i ) - £ V eff (A a ) + \\\Pf\\ 2 . 

i=l a=l 

4 Solution of the variational problem 

In this section we discuss the solution of the variational problem associated 
with the regularizing functionals of the previous sections. Since the cases 
of unreliable data and of negative examples are formally similar we will de- 
rive the equations only in the case of unreliable data. The functional to be 
minimized is 

N 



iU/] = /?*I>e//(A.-) + A||P/|| 2 ), (10) 

and the Euler-Lagrange equations for this functional have the form: 

^/(x)^E^(A,Mx-x,-) (11) 

where V^ f (x) is the first derivative of V e ff(x), that is 

2x 

V eff( X ) - 1 + £0**2-7 • 

We notice that the in the limit of e -► 0, that is in the case of springs that 
never break, 7 goes to infinity and V'(x) -> 2x. In this case the standard 
regularization equations 

PP/(x) = ^f;A^(x-x l ) (12) 

1=1 

are recovered. Equation (11) shows the same structure of that associ- 
ated with the standard regularization case, and the solution can be derived 
using the Green function technique (Stakgold, 1979). As in the standard 
regularization case, (Poggio and Girosi, 1989) the solution will be a linear 
superposition of Green functions, one for each data point: 
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where, in the general case, 



/(x) = 5>G(x;xO (13) 

«=i 



Ci - 2A " 



We notice however that expression (13) is not the complete solution of 
the minimization problem. In fact all the functions that lie in the null space 
of the operator P are "invisible" to the smoothing term in the functional 
(10), so that the previous expansion is the solution modulo a term that lies 
in the null space of P. According to the considerations contained in section 
1, in the following we will drop it from equations. 

In order to find the vector c of coefficients c; we substitute the expansion 
of equation (13) in the functional H[f] defined in equation (10), that becomes 
a function H*(c) of the coefficients. Thus the vector c minimizes the function 
H*(c), which leads to the following set of equations: 

A/T(c) = k = l,...,N (14) 

oc k 

Gradient descent is probably the simplest approach for attempting to 

find the solution to this minimization problem, though, of course, it is not 

guaranteed to converge. Several other iterative methods, such as versions of 

conjugate gradient and simulated annealing may be more appropriate than 

gradient descent, and their use is reccomended. In the gradient descent 

method the vector c that minimizes H*(c) is regarded as the stable fixed 

point of the following dynamical system: 

c = -u Q w (15) 

dc 

where u> is a parameter determining the microscopic timescale of the problem 

and is related to the rate of convergence to the fixed point. 

We consider for simplicity the case of positive definite Green's functions, 

that do not require any additional term in eq. (13). In this case it has been 

shown (Poggio and Girosi, 1989) that, with natural boundary conditions, we 

can write 
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||P/|| 2 = cGc. 

where G is the symmetric matrix (G)ij = G(x,-; Xj) - its symmetry coming 
from the fact that the operator PP is self-adjoint. 
Equations (15) have then the following form: 



d 



that, defining 



(Ti = 



/Tj;V.//(A,-) + Ac.Gc 

»=i 



; (s)ij = Vifiij 



can be written as 

c = -2loG[{(3*XG + A/)c - /TEy] (16) 

wher / is the identity matrix. The vector c that mimizes H*{c) has then to 
satisfy the following set of non linear equations: 

(/TEG + A/)c = /rEy, (17) 

the non linearity being contained in the matrix S, that is a nonlinear function 
of the unknowns. Notice that 

limE = / 

£-►0 

and in this case the iinear standard equations are recovered (Poggio and 
Girosi, 1989). The main implication of the nonlinearity is that the solution of 
these equations is not unique anymore, the different solutions corresponding 
to the local minima of the functional (10). Notice that it is straightforward to 
modify the previous gradient descent equations in order to take into account 
negative examples. 

5 Experimental Results 

In this section we describe some results that we obtained applying these 
techniques to very simple one-dimensional problems. We first discuss an 

13 



example of unreliable data, and then a problem with negative examples. We 
used a gradient descent algorithm with adaptive step, running on a SUN 4 
workstation. The code for these simulations has been written in Common 
Lisp, and in all the examples that we will describe in the next section, the 
time required for 100 iterations of the gradient descent algorithm was about 
30 seconds. In the following figures data points are represented by large dots. 

5.1 Unreliable data 

We approximate the function f(x) = cos(x) in the interval [-1,1]. The 
data set consisted of seven examples, randomly chosen from the graph of /. 
In order to create an outlier in the data set, we substituted the value of the 
fourth point with the value 1.5, that is 50% larger of the largest value of the 
other data points. The Green's function of the problem was a Gaussian of 
variance a = 0.3, the parameter e was set to 0.1, and the parameter /3* was 
set to 6. With this values of e and /? the effective potential was approximately 
constant for values of its argument larger than 1. In figure (4a) we show the 
result that is obtained applying standard regularization theory to approxi- 
mate the data set. The value of the regularization parameter A is 10 -2 , and 
the result obtained after 200 iterations of the gradient descent algorithm is 
shown. The solution, that almost interpolates the data set, hardly resembles 
a cosine function, due to the outlier. If the springs are allowed to break, we 
obtain the result shown in figure (4b), after only 10 iterations of gradient 
descent: the spring of the outlier breaks, and the solution is not influenced 
by the outlier. Since the variance of the Gaussian Green's function is small 
(a = 0.3) the solution has a "hole" in correspondence of the outlier, because 
there are no data there. A similar situation is shown in figures (4c) and (4d), 
the only difference being the value of the regularization parameter, that is ten 
times larger, that is A = 10" 1 . We notice that, since the Green's function is 
bounded, increasing the regularization parameter has the effect of decreasing 
the norm of the solution (see Poggio and Girosi, 1989). This effect is evident 
when comparing figures (4a) and (4b) with figures (4c) and (4d). 

5.2 Negative examples 

In order to test the negative example technique we choose again, as a 
function to be approximated, the cosine function f(x) = cos(z), randomly 

14 
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Figure 4: Approximation in presence of an outlier (the data point whose 
value is 1.5. Comparisons between standard regularization ((a) and (c)) and 
the extension introduced here ((b) and (d». See text for explanation. 
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Figure 5: Negative examples.(a) and .(b): the configurations corresponding 
to different minima of the functional. (<*): the effect of increasing the attrac- 
tiveness of the standard springs. See text for explanation 
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sampled at seven points in the interval [-1,1]. In all the experiments the 
regularization parameter was set to zero, its role not being crucial in this 
case. The Green's functions we used were always Gaussians, with variance 
different from case to case. The fourth data point, whose coordinates were 
( x>y ) = (-0.15,0.99), was selected as the negative example, and the pa- 
rameters /?* and e was the same as in the previous case, so that the springs 
could break if the elongation were larger than 1. This meant that the result 
had to be a function f*(x) that approximates the six "positive" examples, 
but such that |/*(-0.15) - 0.99| > 1. There are clearly two possibilities: 
(/*(— 0.15) - 0.99) > 1 and (0.99 - /*(— 0.15)) > 1, corresponding to func- 
tions "passing above and below the negative example" . These configurations 
corresponds to two different minima of the functional, and we expect to ob- 
tain one of these two configurations depending on the initial conditions of 
the gradient descent algorithm. 

In figure (5a) and (5b) we show two results corresponding to two different 
local minima. Convergence was reached in 50 iterations, and in both cases 
the variance of the Gaussian is a = 0.2. In figure (5a) we set as initial 
condition a = y<, and in figure (5b) we set c, = 0.0. In the first case the 
initial condition corresponds to a function that is "above" the data, while 
in the second case the initial function is zero everywhere, and then "below" 
the data. In the first case the final value of the "energy" of the system was 
H = -0.996, that is very close to the global minimum energy H = -1.0, 
while in the second case the energy was H = -0.931. Interpreting these 
results in terms of springs, it is evident, in figure (5b), that the spring on the 
left of the negative example is not sufficiently strong to pull up the solution 
to the datum. We then changed the elastic constant of this spring and of the 
corresponding one on the right of the negative example, setting their values to 
10, that is ten times larger than the other ones. The result is shown in figure 
(5c), and it is clearly better than the one shown in figure (5b), its associated 
energy being H = -0.995, that is comparable with the value H = -0.996 of 
figure (5a). 

From the previous result and many other experiments it is apparent that 
the energy landscape associated with this minimization problem could be 
very complicated, with many local minima corresponding to the two types 
of configurations ("above" and "below"). It is natural to ask whether during 
the gradient descent iterations the system naturally "jumps" from one of 
these configurations to the other one. The answer is given in figures (6a) and 
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(6b). In figure (6a) we show the configuration of the system corresponding 
to the iterations 1, 30, 31 and 40 of the gradient descent algorithm. The 
variance of the Gaussian Green's function is a = 0.8, and the starting point 
of the descent procedure is c,- = 0.0. At the beginning the configuration is of 
type "below", because it is identically zero, and then it stabilizes around an 
interpolating function until iteration 30. At iteration 30 the system jumps 
in a configuration of type "above" , whose energy is much lower, and then 
converges rapidly to a local mimimum. In figure (6b) the energy of the 
system is shown as function of the number of iterations: notice the jump at 
iteration 30, that probably corresponds to a discontinuity of the gradient of 
the energy surface. 

In order to escape local minima we used a simple form of stochastic 
gradient descent, adding a white noise term to eq. (15). The noise term 
was used only to get out of local minima, that is it was switched on only 
when the energy decreased, from one iteration to the next one, of an amount 
lower than a small threshold (usually 10 -8 ). The usefulness of the noise is 
shown in figures (7a) and (6b). The data are the same of figures (6) and (5), 
but the break point of the spring of the negative example has value 1, the 
variance of the Gaussian Green's function is a = 0.4 and the amplitude of 
the noise is 10~ 2 . In figure (7a) we show the result of the gradient descent 
algorithm without noise. Convergence was obtained after 25 iterations, and 
the result is not very good, corresponding to some local minimum. In figure 
(7b) we show the result of the stochastic gradient descent algorithm after 1000 
iterations: the local minima have been escaped, and the result is almost a 
perfect interpolant on the "positive" examples. 

Interesting effects take place if we raise the amplitude of the noise. In 
figure (8a), (8b) and (8c) we show what happens if, in the previous example, 
we set the amplitude of the noise to 10 _1 , instead of 10 -2 . The results of 
the stochastic descent procedure are shown at iterations 200, 500 and 2000. 
We notice that the system jumps from a configuration of type "below" to a 
configuration of type "above" and then to a configuration of type "below" 
again. This suggests that there are several local minima, and the noise makes 
the system jumping from one to another. In figure (8d) the energy of the 
system as function of the number of iterations is shown: notice that the 
algorithm does not inject the noise continuosly, but only when the energy 
stops decreasing. 
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Figure 6: Negative examples, (a) Several connguratkHis of the system are 
shown, while equilibrium is being apftfoached. (b) The learning curve. 
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Figure 7: Negative examples. Results obtained without (a) and (b) noise. 
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6 Remarks 

1. The first extension we have introduced here - to deal with unreliable 
data - may be important in problem of the type of surface reconstruc- 
tion, as one encounters in computer vision. It may or may not be useful 
in problems of learning from examples. 

2. The second extension - to exploit negative examples - is especially 
interesting for the problem of learning, where often negative examples 
are present (though they usually are less important than the positive 
ones). In some cases it may be useful also in problems of approximation 
of functions. There are situations in which one knows that certain 
regions of the range of the function are forbidden. Interestingly, this 
type of problems seems to have been ignored in the classical approach to 
function approximation (see Verri and Poggio, 1988 for related, simpler 
and more classical cases). The functional we considered, and then the 
type of spring we used, is feasible of further modifications, according to 
the a priori knowledge about the system. For example, the constraint 
that the values of a one dimensional function are bounded from above 
(and/or below) can be included using springs that are negative from 
one side and positive from the other side. 

3. In both the extensions that we have presented the solution has the 
form (13), which has a simple interpretation in terms of feedforward 
networks with one layer of hidden units, of the same class of the regu- 
larization networks introduced in previous papers (Poggio and Girosi, 
1989; 1990). 

Acknowledgements We thank Cesare Furlanello for useful discussions and 
for a critical reading of the manuscript. 
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